Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
29430
posts in
67.0
ms
ScoutAttention
: Efficient KV Cache
Offloading
via Layer-Ahead CPU Pre-computation for LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
2d
·
…
Blog #0192: How
Tokens
Talk to Each Other
📊
Model Serving Economics
matthewsinclair.medium.com
·
3d
·
…
Prefix
caching
for LLM inference optimization
💾
Prompt Caching
bentoml.com
·
2d
·
Hacker News
·
…
Systematic
Analysis of CPU-Induced
Slowdowns
in Multi-GPU LLM Inference (Georgia Tech)
🏗️
LLM Infrastructure
semiengineering.com
·
5d
·
…
What is
inference
engineering?
Deepdive
🏗️
LLM Infrastructure
newsletter.pragmaticengineer.com
·
2d
·
…
alexziskind1/llm-inference-calculator
🏗️
LLM Infrastructure
github.com
·
1d
·
…
Speculative
Decoding: Performance or
Illusion
?
📊
Model Serving Economics
specdecode-bench.github.io
·
6d
·
Hacker News
·
…
What if AI doesn’t need more
RAM
but better
math
?
🔬
RaBitQ
adlrocha.substack.com
·
4d
·
Substack
·
…
From 300KB to 69KB per Token: How LLM
Architectures
Solve the
KV
Cache Problem
💾
Prompt Caching
news.future-shock.ai
·
4d
·
Hacker News
·
…
BioInfo/dendrite
: Agent-native inference engine with O(1) fork latency for tree-structured reasoning
🕯️
Candle
github.com
·
3d
·
Hacker News
·
…
Understand and
Accelerate
Memory Processing Pipeline for
Disaggregated
LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
1d
·
…
Efficient
Inference
of Large Vision Language Models
📦
Batch Embeddings
arxiv.org
·
2d
·
…
Multiple-Prediction-Powered
Inference
📦
Batch Embeddings
arxiv.org
·
2d
·
…
ITQ3
_S: High-Fidelity 3-bit LLM Inference via
Interleaved
Ternary Quantization with Rotation-Domain Smoothing
🔢
BitNet Inference
arxiv.org
·
2d
·
…
Robust Batch-Level Query
Routing
for Large Language Models under Cost and Capacity
Constraints
🧠
Inference Serving
arxiv.org
·
2d
·
…
MemBoost
: A
Memory-Boosted
Framework for Cost-Aware LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
3d
·
…
Tucker Attention: A generalization of
approximate
attention
mechanisms
🎯
Deep Work
arxiv.org
·
1d
·
…
Quantization
with Unified Adaptive Distillation to enable
multi-LoRA
based one-for-all Generative Vision Models on edge
📱
Edge AI Optimization
arxiv.org
·
1d
·
…
Rocks,
Pebbles
and Sand:
Modality-aware
Scheduling for Multimodal Large Language Model Inference
✨
Gemini
arxiv.org
·
3d
·
…
SliderQuant
: Accurate Post-Training
Quantization
for LLMs
🏗️
LLM Infrastructure
arxiv.org
·
6d
·
…
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help